Collect Delta extended statistics during insert#16026
Collect Delta extended statistics during insert#16026ebyhr merged 4 commits intotrinodb:masterfrom pajaks:pajaks/analyze_on_insert
Conversation
There was a problem hiding this comment.
maxFileModificationTime probably doesn't need to be optional, since for empty insert there should be no stats collected.
@findinpath does empty insert still create a transaction log entry?
There was a problem hiding this comment.
I checked and insert creates transaction log entry for that case.
I will add check for that case, but I would leave Optional here to meet updateTableStatistics arguments, as in general maxFileModificationTime can be empty (for example during ANALYZE when it can't be retrieved from computed statistics).
There was a problem hiding this comment.
@findinpath does empty insert still create a transaction log entry?
I remember doing something like this for Iceberg:
I don't think we did this for Delta Lake.
We should probably add an issue to handle this aspect as well on Delta.
There was a problem hiding this comment.
do we have coverage for ANALYZE filling in the NDV information if it was not present before?
should we run the above INSERT with stats collection disabled?
There was a problem hiding this comment.
I think NDV information update by ANALYZE is covered in testStatisticsOnInsertWhenCollectionOnWriteDisabled.
Filling previously empty NVD with ANALYZE is covered in testCreateTableStatisticsWhenCollectionOnWriteDisabled.
There was a problem hiding this comment.
nit: we put all arguments on one line, or each on separate line
There was a problem hiding this comment.
nit: we put all arguments on one line, or each on separate line
|
Excluding a column when the existing stats don't have an entry for it seems like the right thing to do, but we might want to try to do better. We can't tell the difference right now between two things:
If we know the column is new and that's why we don't have stats for it, we could include it in the stats during the next insert. |
In ANALYZE WITH user specify columns to be analyzed. So it’s more include than exclude. |
There was a problem hiding this comment.
Can you pls create a ticket for tracking this issue?
There was a problem hiding this comment.
separate commit -unrelated to the current commit
There was a problem hiding this comment.
@findinpath does empty insert still create a transaction log entry?
I remember doing something like this for Iceberg:
I don't think we did this for Delta Lake.
We should probably add an issue to handle this aspect as well on Delta.
There was a problem hiding this comment.
I find that disabling the collection stats is a pragmatical choice here, given that the table had no stats collected (it is being registered) before doing operations on it . No change requested.
There was a problem hiding this comment.
Or we update the assertions to include stats, that seems like a better way to me.
There was a problem hiding this comment.
Would it make sense to verify on the file level whether the extended_stats.json file remains untouched?
There was a problem hiding this comment.
pre-existing:
Running
assertUpdate(format("ANALYZE %s WITH(columns = ARRAY['nationkey', 'regionkey'])", tableName));
succeeds (as expected).
Now running any of the commands:
assertUpdate(format("ANALYZE %s WITH(columns = ARRAY['nationkey'])", tableName));
or
assertUpdate(format("ANALYZE %s", tableName));
fails with the message:
List of columns to be analyzed must be a subset of previously used. To extend list of analyzed columns drop table statistics
Now the user needs to know what columns were actually analyzed before to match them exactly in the ANALYZE statement.
Maybe we should add in the exception message (separate PR) more information about the columns currently having stats.
There was a problem hiding this comment.
comment change should move to Remove split count verification for ANALYZE commit
There was a problem hiding this comment.
should move to Remove split count verification for ANALYZE commit?
There was a problem hiding this comment.
This is check is added to test functionality, so I think it should stay in this commit.
There was a problem hiding this comment.
verifySplitCount is unused. can be removed.
There was a problem hiding this comment.
getOperatorStats can also be removed.
There was a problem hiding this comment.
nit: we can name the table name prefix as per the test name?
There was a problem hiding this comment.
nit: follow the same in other test methods
alexjo2144
left a comment
There was a problem hiding this comment.
Overall LGTM. We mentioned offline that a test which adds a column and then does an insert should include stats for the new column. Would be nice to have.
There was a problem hiding this comment.
nit TODO: https://github.com/trinodb/trino/issues/16088 -> TODO (https://github.com/trinodb/trino/issues/16088)
There was a problem hiding this comment.
I guess for finishInsert is not relevant, but eventually for finishMerge (follow-up PR) do take into account that you'll need to filter out only data files dataFile.getDataFileType() == DATA
There was a problem hiding this comment.
nit: can be inlined in the assignment for analyzedColumns
It turned out that added column is not added to statistics even for legacy ANALYZE. I think it should be handled by different PR (proposed solution #16109). |
Sounds good to me. We do need to decide if we merge this before support for column mapping and adding/dropping columns. Thinking, if a column is dropped and added back with the same name our stats should be reset instead of reviving the old data. |
|
Rebased to resolve conflict with #16108. |
|
CI hit: #11131 |
|
Merged, thanks! |
Description
Collect delta lake statistics for INSERT.
Additional context and related issues
Release notes
(x) Release notes are required, with the following suggested text: